Title Similarity-Based Feature Weighting for Text Categorization
نویسندگان
چکیده
In automated text categorization, a system analyzes a natural-language document to decide whether it belongs in one or more of a group of pre-defined categories. The typical approach is to represent the documents using feature vectors, and inductively generate a classifier based on a training set of documents and their manually-assigned categories. Such a process ignores information on word order, syntax, and other heuristics that might aid in identifying good features for categorization. Recently, more attention has been paid to using deeper natural language processing techniques to improve the performance of the standard classifiers. One such approach, which takes advantage of a previously-generated thesaurus of lexical similarities, is studied in this project. This system identifies key-words in the text by looking for terms with high similarity to the terms in the title field. A database of automatically-clustered dependency-based word similarities is used to identify the similar words. Experiments show increased weighting of key terms aids the effectiveness of text categorization for a number of topics in the standard Reuters newswire corpus.
منابع مشابه
Feature Weighting Improvement of Web Text Categorization Based on Particle Swarm Optimization Algorithm
It is usually true that some structures like title can express the main content of texts, and these structures may have an influence on the effectiveness of text categorization. However, the most common feature weighting algorithms, called term frequency-inverse document frequency (TF-IDF) doesn’t think about the structural information of texts. To solve this problem, a new feature weighting al...
متن کاملEnriched Format Text Categorization Using A Component Similarity Approach
Text categorization has been widely studied for years. However, conventional plain text categorization approaches which work good in plain text behave poor when they are simply applied to enriched format texts. An categorization approach that is applicable to enriched format text is proposed. During feature selection, we get feature structure distribution weight by using extended structure mode...
متن کاملImproving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملThe Analysis and Optimization of KNN Algorithm Space-Time Efficiency for Chinese Text Categorization
The performance of any algorithm for text classification are reflected in the of reliability classification results and classification algorithm is high efficient. We analyze the space-time efficiency of different stages based on the traditional KNN algorithm process for Chinese text classification and ensure the reliability of classification. And we optimize efficiency of the algorithm and the...
متن کاملCompherensive Review Of Text Classification Using Machine Learning
Text Classification, also known as text categorization, is the task of automatically allocating unlabeled documents into predefined categories. Text Classification means allocating a document to one or more categories or classes. The ability to accurately perform a classification task depends on the representations of documents to be classified. Text representations transform the textural docum...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004